The boosting paradigm allows the learner to have smooth control over this tradeoff. The learning starts with a basic class (that might have a large approximation error), and as it progresses the class that the predictor may belong to grows richer.
A boosting algorithm amplifies the accuracy of weak learners. When a weak learner can be implemented efficiently, boosting provides a tool for aggregating such weak hypotheses to approximate gradually good predictors for larger, and harder to learn, classes.
We will derive Adaboost as a so-called "steepest descent" method (closely related to the gradient descent). Last time, we showed that
gˉ=g:S→{±1}argminE[exp(−Yg(X))]
satisfies gˉ(x)=21log(1−η(x)1+η(x)) and that signgˉ=g∗(x) is the Bayes classifier.
Next, we will consider the empirical (based on the training data) analogue of this minimization problem: given a class G of functions (not necessarily binary classifiers), find the one that minimizes
g↦n1j=1∑nexp(−Yjg(Xj))
We will call the minimizer g^n. The specific class of function we will be looking at is constructed as follows: let F be the simple "base class" consisting of binary classifiers (e.g. the threshold classifiers or linear separators), and set
G={j=1∑kαjfj:k≥1,α0,…,αk∈R,f0,…,fk∈F}
This is known as the "convex hull" of F which consists of all convex combinations of the elements of F. Another way to think about g=∑j=1kαjfj is that the sign of g is the "majority vote" among f1,…,fk where the "vote" of fj carries weight αj.
It turns out that this problem admits an efficient solution under a rather weak assumption on the class F, called the "weak learnability" condition.
We will say that a class F of binary classifiers satisfies the weak learnability condition if for any collection of points (xj,yj)j=1n∈S×{±1},n≥1 and any nonnegative weights w1,…,wn,∑j=1nwj=1, there exists f∈F such that ∑j=1nwjI(Yj=f(Xj))≤1/2 (in simple terms, f "does better than a random guess").
Note that weak learnability is implied by symmetry, meaning that f∈F⟺−f∈F; indeed, if ∑j=1nwjI(Yj=f(Xj))>21 then ∑j=1nwjI(Yj=f(Xj))<21.
Let's examine one step of the (version of) the so-called "steepest descent" algorithm for minimizing n1∑j=1nexp(−Yjg(Xj)). The steepest descent is essentially a version of gradient descent method where we use an "approximate" gradient at each step, instead of the true gradient.
Assume that g∈G our current guess. We will look for α∈R and f∈F that minimize (approximately)
Intuitively, such an f - the direction of "steepest descent" - can be seen as an approximate gradient. Define wj=n1e−Yjf(Xj),j=1,…,n, to be the weights. Note that wj≥0. Let w~j=∑j=1nwjwj, so that ∑j=1nw~j=1. Our problem is then to minimize ∑j=1nw~je−αf(Xj)Yj over f∈F,α∈R. Since f takes only two values ±1, we have that
where en,w~(f)=∑j=1nw~jI(Yj=f(Xj)) is the "weighted" training error. To minimize the resulting expression, we proceed in two steps:
Minimize ∑j=1nw~jI(Yj=f(Xj)) with respect to f
Minimize e−α+(eα−e−α)en,w~(f) with respect to α.
The "weak learnability" assumption is needed to complete step 1: recall that for any nonnegative weights w~1,…,w~n with ∑j=1nw~j=1,∃f∈F such that en,w~(f)≤21. We will assume access to a "black box" weak learning algorithm that takes w~1,…,w~n and (X1,Y1),…,(Xn,Yn) as inputs and outputs some f∈F such that en,w~(f)≤21.
Exercise. Describe such a "weak learning" algorithm for the class consisting of
threshold classifiers gt+and gt−
linear separators (e.g., you could rely on Logistic regression).
Assuming that en,w~(f)≤21, the minimum of e−α+(eα−e−α)en,w~(f) occurs for
α^=21logen,w~(f)1−en,w~(f)≥0.
Adaboost is an algorithm that repeats the steps outlined above. We present it now.
a) Note that wj(T+1)=n1∏t=1TZte−Yj∑t=1Tαtft(Xj); this is easy to show by induction and the details will be filled in during the lecture.
b) We have that
where the last multiplicand is en,w(t)(ft). Recall that αt=21log(en,w(t)(ft)1−en,w(t)(ft)), we thus have that
Zt=2en,w(t)(ft)(1−en,w(t)(ft)).
d) The function f(x)=x(1−x);x∈[0,21−γ] is maximized for x=21−γ, thus
Zt≤2(1/2−γ)(1/2+γ)≤1−4γ2≤e−4γ2=e−2γ2,
since 1−x≤e−x for x∈[0,1]. Therefore
n1j=1∑nI{Yj=gT(Xj)}=t=1∏TZt≤exp(−2Tγ2).
In conclusion, the training error goes to 0 exponentially fast. However, the main object of interest is the "generalization error"
P(YgT(X))≤0.
Fortunately, in this case the generalization error is also going to be small: if the VC dimension of the base class F is V, then the generalization error is at most KnV larger than the training error with high probability, where K is some fixed number. The proof of this fact is rather difficult however and won't be covered in this course.